AITopics | negative image

Collaborating Authors

negative image

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

b8b93c48f5bfa385d071342089d70422-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsApr-30-2026, 01:20:31 GMT

caption, large language model, machine learning, (22 more...)

Neural Information Processing Systems

Country: Europe (0.93)

Genre:

Overview (0.68)
Research Report > New Finding (0.46)

Industry:

Information Technology (0.68)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

VLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Neural Information Processing SystemsFeb-17-2026, 17:46:41 GMT

Compositionality is still a challenging problem.

caption, large language model, machine learning, (22 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Spain > Basque Country (0.04)

Genre:

Overview (0.68)
Research Report > New Finding (0.46)

Industry:

Information Technology (0.68)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Sensing and Signal Processing > Image Processing (0.84)
(2 more...)

Add feedback

Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization

Li, Xintong, Wang, Chuhan, Wu, Junda, Surana, Rohan, Yu, Tong, McAuley, Julian, Shang, Jingbo

arXiv.org Artificial IntelligenceOct-1-2025

Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.

arxiv preprint arxiv, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2509.25717

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Add feedback

YoChameleon: Personalized Vision and Language Generation

Nguyen, Thao, Singh, Krishna Kumar, Shi, Jing, Bui, Trung, Lee, Yong Jae, Li, Yuheng

arXiv.org Artificial IntelligenceApr-30-2025

Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2504.20998

Country:

North America > United States (0.28)
Asia (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Progressive Compositionality In Text-to-Image Generative Models

Han, Xu, Jin, Linghao, Liu, Xiaofeng, Liang, Paul Pu

arXiv.org Artificial IntelligenceOct-22-2024

Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges by optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate high-quality complex contrastive images that diffusion models can directly discriminate based on visual representations? These pairs feature minimal visual discrepancies and cover a wide range of attribute categories, especially complex and natural scenarios. To learn effectively from these error cases, i.e., hard negative images, we propose E Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks. The rapid advancement of text-to-image generative models (Saharia et al., 2022; Ramesh et al., 2022) has revolutionized the field of image synthesis, driving significant progress in various applications such as image editing (Brooks et al., 2023; Zhang et al., 2024), video generation (Brooks et al., 2024) and medical imaging (Han et al., 2024a). Common issues include incorrect attribute binding, miscounting, and flawed object relationships as shown in Figure 1. For example, when given the prompt "a red motorcycle and a yellow door", the model might incorrectly bind the colors to the objects, resulting in a yellow motorcycle. Recent progress focuses on optimizing the attention mechanism within diffusion models to better capture the semantic information conveyed by input text prompts (Agarwal et al., 2023; Chefer et al., 2023; Pandey et al., 2023).

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2410.16719

Country:

North America > United States > California (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.85)

Industry:

Leisure & Entertainment > Sports (0.68)
Health & Medicine > Diagnostic Medicine > Imaging (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)

Add feedback

FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension

Liu, Junzhuo, Yang, Xuzheng, Li, Weiwei, Wang, Peng

arXiv.org Artificial IntelligenceSep-23-2024

Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model's ability to correctly reject scenarios where the target object is not visible in the image--an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at https://github.com/liujunzhuo/FineCops-Ref.

dataset, expression, negative sample, (14 more...)

arXiv.org Artificial Intelligence

2409.1475

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(6 more...)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Miranda, Imanol, Salaberria, Ander, Agirre, Eneko, Azkune, Gorka

arXiv.org Artificial IntelligenceJun-14-2024

Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts improves the state of the art in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at https://imirandam.github.io/BiVLC_project_page.

caption, dataset, retrieval, (16 more...)

arXiv.org Artificial Intelligence

2406.09952

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Spain > Basque Country (0.04)

Genre:

Research Report (0.82)
Overview (0.68)

Industry:

Information Technology (0.46)
Government (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

Zhang, Jianrui, Cai, Mu, Xie, Tengyang, Lee, Yong Jae

arXiv.org Artificial IntelligenceJun-12-2024

We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V. To facilitate future research, we release our code, dataset, benchmark, and checkpoints at https://countercurate.github.io.

caption, countercurate, reasoning, (16 more...)

arXiv.org Artificial Intelligence

2402.13254

Country: North America > United States > Wisconsin > Dane County > Madison (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain

Efird, Cory, Murphy, Alex, Zylberberg, Joel, Fyshe, Alona

arXiv.org Artificial IntelligenceMay-27-2024

Prior work has offered evidence for functional localization in the brain; different anatomical regions preferentially activate for certain types of visual input. For example, the fusiform face area preferentially activates for visual stimuli that include a face. However, the spectrum of visual semantics is extensive, and only a few semantically-tuned patches of cortex have so far been identified in the human brain. Using a multimodal (natural language and image) neural network architecture (CLIP) we train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings. We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of these participant-specific contrastive models. This reveals what we call Shared Decodable Concepts (SDCs): clusters in CLIP space that are decodable from common sets of voxels across multiple participants. Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC. We note SDCs for previously reported visual features (e.g. orientation tuning in early visual cortex) as well as visual semantic concepts such as faces, places and bodies. In cases where our method finds multiple clusters for a visuo-semantic concept, the least associated images allow us to dissociate between confounding factors. For example, we discovered two clusters of food images, one driven by color, the other by shape. We also uncover previously unreported areas such as regions of extrastriate body area (EBA) tuned for legs/hands and sensitivity to numerosity in right intraparietal sulcus, and more. Thus, our contrastive-learning methodology better characterizes new and existing visuo-semantic representations in the brain by leveraging multimodal neural network representations and a novel adaptation of clustering algorithms.

negative image, participant, representation, (14 more...)

arXiv.org Artificial Intelligence

2405.17663

Country: North America > Canada > Alberta (0.14)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(2 more...)

Add feedback

Filters

Collaborating Authors

negative image

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

f9fd24fd32eccc14cd3ecd3716a1cbf8-Supplemental-Conference.pdf

b8b93c48f5bfa385d071342089d70422-Paper-Datasets_and_Benchmarks_Track.pdf

VLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization

YoChameleon: Personalized Vision and Language Generation

Progressive Compositionality In Text-to-Image Generative Models

FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain